- 
                Notifications
    
You must be signed in to change notification settings  - Fork 15.1k
 
[AMDGPU] misched: avoid subregister dependencies #140255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 
          
 @llvm/pr-subscribers-backend-amdgpu Author: Robert Imschweiler (ro-i) ChangesThere are some VOP3P instructions which operate on packed 32bit values and can be configured (op_sel/op_sel_hi) to only use one of the values. This patch adapts the scheduling dependencies so that a write to vgpr3, for example, is not a data dependency for a read from vgpr2_vgpr3 in case only vgpr2 is actually used. @arsenm (see #137549 (comment)) Patch is 40.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140255.diff 5 Files Affected: 
 diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.cpp b/llvm/lib/Target/AMDGPU/GCNSubtarget.cpp
index d6153ce93b451..843aca57be2bf 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.cpp
@@ -535,6 +535,62 @@ unsigned GCNSubtarget::getMaxNumVGPRs(const MachineFunction &MF) const {
   return getBaseMaxNumVGPRs(F, MFI.getWavesPerEU());
 }
 
+bool GCNSubtarget::isRealSchedDependency(MachineInstr *DefI, int DefOpIdx,
+                                         MachineInstr *UseI,
+                                         int UseOpIdx) const {
+  if (!InstrInfo.isVOP3P(*UseI))
+    return true;
+  MachineOperand &DefOp = DefI->getOperand(DefOpIdx);
+  if (!DefOp.isReg() || !DefOp.getReg().isPhysical())
+    return true;
+
+  AMDGPU::OpName UseModName;
+  if (AMDGPU::getNamedOperandIdx(UseI->getOpcode(), AMDGPU::OpName::src0) ==
+      UseOpIdx)
+    UseModName = AMDGPU::OpName::src0_modifiers;
+  else if (AMDGPU::getNamedOperandIdx(UseI->getOpcode(),
+                                      AMDGPU::OpName::src1) == UseOpIdx)
+    UseModName = AMDGPU::OpName::src1_modifiers;
+  else if (AMDGPU::getNamedOperandIdx(UseI->getOpcode(),
+                                      AMDGPU::OpName::src2) == UseOpIdx)
+    UseModName = AMDGPU::OpName::src2_modifiers;
+  else
+    return true;
+  MachineOperand *UseOpMod = InstrInfo.getNamedOperand(*UseI, UseModName);
+  if (!UseOpMod)
+    return true;
+  // Check whether all parts of the register are being used (= op_sel and
+  // op_sel_hi differ). In that case we can return early.
+  auto OpSel = UseOpMod->getImm() & SISrcMods::OP_SEL_0;
+  auto OpSelHi = UseOpMod->getImm() & SISrcMods::OP_SEL_1;
+  if ((!OpSel || !OpSelHi) && (OpSel || OpSelHi))
+    return true;
+
+  MachineOperand &UseOp = UseI->getOperand(UseOpIdx);
+  if (!UseOp.isReg() || !UseOp.getReg().isPhysical())
+    return true;
+  const SIRegisterInfo *TRI = getRegisterInfo();
+  const MachineRegisterInfo &MRI = UseI->getParent()->getParent()->getRegInfo();
+  MCRegister DefReg = DefOp.getReg().asMCReg();
+  MCRegister UseReg = UseOp.getReg().asMCReg();
+  // We specifically look for a packed 32bit Use and smaller Def.
+  if (TRI->getRegSizeInBits(UseReg, MRI) != 64 ||
+      TRI->getRegSizeInBits(DefReg, MRI) > 32)
+    return true;
+  SmallVector<MCRegUnit, 2> DefRegUnits(TRI->regunits(DefReg));
+  assert(DefRegUnits.size() <= 2 && "unexpected number of register units");
+  SmallVector<MCRegUnit, 4> UseRegUnits(TRI->regunits(UseReg));
+  assert(UseRegUnits.size() == 4 && "unexpected number of register units");
+
+  auto FindRegunit = [&DefRegUnits](MCRegUnit A, MCRegUnit B) {
+    return llvm::find_if(DefRegUnits, [A, B](MCRegUnit RU) {
+             return RU == A || RU == B;
+           }) != DefRegUnits.end();
+  };
+  return OpSel ? FindRegunit(UseRegUnits[2], UseRegUnits[3])
+               : FindRegunit(UseRegUnits[0], UseRegUnits[1]);
+}
+
 void GCNSubtarget::adjustSchedDependency(
     SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx, SDep &Dep,
     const TargetSchedModel *SchedModel) const {
@@ -545,6 +601,11 @@ void GCNSubtarget::adjustSchedDependency(
   MachineInstr *DefI = Def->getInstr();
   MachineInstr *UseI = Use->getInstr();
 
+  if (!isRealSchedDependency(DefI, DefOpIdx, UseI, UseOpIdx)) {
+    Dep = SDep(Def, SDep::Artificial);
+    return; // this is not a data dependency anymore
+  }
+
   if (DefI->isBundle()) {
     const SIRegisterInfo *TRI = getRegisterInfo();
     auto Reg = Dep.getReg();
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index fea17baa17722..edee53a53a2d0 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -272,6 +272,17 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   SITargetLowering TLInfo;
   SIFrameLowering FrameLowering;
 
+  /// From the (MI300) ISA:
+  /// "Packed 32-bit instructions operate on 2 dwords at a time and those
+  /// operands must be two-dword aligned (i.e. an even VGPR address). Output
+  /// modifiers are not supported for these instructions. OPSEL and OPSEL_HI
+  /// work to select the first or second DWORD for each source."
+  /// -> We can save dependencies on VGPRs by analyzing the operand selection.
+  /// See also
+  /// https://llvm.org/docs/AMDGPUModifierSyntax.html#amdgpu-synid-op-sel
+  bool isRealSchedDependency(MachineInstr *DefI, int DefOpIdx,
+                             MachineInstr *UseI, int UseOpIdx) const;
+
 public:
   GCNSubtarget(const Triple &TT, StringRef GPU, StringRef FS,
                const GCNTargetMachine &TM);
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.exp.large.mir b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.exp.large.mir
index aad6e031aa9ed..6c4ebef38057b 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.exp.large.mir
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.exp.large.mir
@@ -464,16 +464,10 @@
   ; GCN-NEXT:    buffer_load_dwordx2 v[130:131], v64, s[0:3], 0 offen sc0 sc1
   ; GCN-NEXT:    s_waitcnt vmcnt(0)
   ; GCN-NEXT:    buffer_inv sc0 sc1
-  ; GCN-NEXT:    v_fma_f32 v57, s4, v57, -v134
   ; GCN-NEXT:    v_fma_f32 v48, s4, v48, -v134
-  ; GCN-NEXT:    v_fma_f32 v96, s4, v58, -v134
-  ; GCN-NEXT:    v_mul_f32_e32 v57, 0x3fb8aa3b, v57
   ; GCN-NEXT:    v_mul_f32_e32 v48, 0x3fb8aa3b, v48
   ; GCN-NEXT:    v_fma_f32 v64, s4, v49, -v134
-  ; GCN-NEXT:    v_exp_f32_e32 v163, v57
-  ; GCN-NEXT:    v_mul_f32_e32 v57, 0x3fb8aa3b, v96
   ; GCN-NEXT:    v_fma_f32 v66, s4, v50, -v134
-  ; GCN-NEXT:    v_exp_f32_e32 v164, v57
   ; GCN-NEXT:    v_exp_f32_e32 v49, v48
   ; GCN-NEXT:    v_mul_f32_e32 v48, 0x3fb8aa3b, v64
   ; GCN-NEXT:    v_fma_f32 v67, s4, v51, -v134
@@ -495,25 +489,27 @@
   ; GCN-NEXT:    ds_read_b128 v[140:143], v139
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    buffer_inv sc0 sc1
+  ; GCN-NEXT:    ds_read_b128 v[144:147], v139 offset:576
+  ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+  ; GCN-NEXT:    buffer_inv sc0 sc1
   ; GCN-NEXT:    v_exp_f32_e32 v54, v48
   ; GCN-NEXT:    v_mul_f32_e32 v48, 0x3fb8aa3b, v70
   ; GCN-NEXT:    v_exp_f32_e32 v55, v48
   ; GCN-NEXT:    v_mul_f32_e32 v48, 0x3fb8aa3b, v71
-  ; GCN-NEXT:    ds_read_b128 v[144:147], v139 offset:576
-  ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-  ; GCN-NEXT:    buffer_inv sc0 sc1
   ; GCN-NEXT:    v_fma_f32 v66, s4, v56, -v134
   ; GCN-NEXT:    v_exp_f32_e32 v56, v48
   ; GCN-NEXT:    v_sub_f32_e32 v48, v65, v134
+  ; GCN-NEXT:    ds_read_b128 v[148:151], v139 offset:1152
+  ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+  ; GCN-NEXT:    buffer_inv sc0 sc1
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v64, v49
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v67, v50
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v68, v51
+  ; GCN-NEXT:    v_fma_f32 v96, s4, v58, -v134
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v58, v52
   ; GCN-NEXT:    v_mul_f32_e32 v48, 0x3fb8aa3b, v48
-  ; GCN-NEXT:    ds_read_b128 v[148:151], v139 offset:1152
-  ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-  ; GCN-NEXT:    buffer_inv sc0 sc1
   ; GCN-NEXT:    v_exp_f32_e32 v48, v48
+  ; GCN-NEXT:    v_fma_f32 v57, s4, v57, -v134
   ; GCN-NEXT:    v_pack_b32_f16 v161, v68, v58
   ; GCN-NEXT:    v_pack_b32_f16 v160, v64, v67
   ; GCN-NEXT:    v_mul_f32_e32 v58, 0x3fb8aa3b, v66
@@ -521,9 +517,7 @@
   ; GCN-NEXT:    ds_read_b128 v[152:155], v139 offset:1728
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    buffer_inv sc0 sc1
-  ; GCN-NEXT:    v_fma_f32 v162, s4, v61, -v134
-  ; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v55
-  ; GCN-NEXT:    v_cvt_f16_f32_e32 v57, v56
+  ; GCN-NEXT:    ; implicit-def: $vgpr80_vgpr81_vgpr82_vgpr83_vgpr84_vgpr85_vgpr86_vgpr87_vgpr88_vgpr89_vgpr90_vgpr91_vgpr92_vgpr93_vgpr94_vgpr95
   ; GCN-NEXT:    v_pk_mul_f32 v[64:65], v[64:65], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[66:67], v[66:67], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[68:69], v[68:69], v[48:49] op_sel_hi:[1,0]
@@ -532,10 +526,8 @@
   ; GCN-NEXT:    v_pk_mul_f32 v[74:75], v[74:75], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[76:77], v[76:77], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[78:79], v[78:79], v[48:49] op_sel_hi:[1,0]
-  ; GCN-NEXT:    ; implicit-def: $vgpr80_vgpr81_vgpr82_vgpr83_vgpr84_vgpr85_vgpr86_vgpr87_vgpr88_vgpr89_vgpr90_vgpr91_vgpr92_vgpr93_vgpr94_vgpr95
-  ; GCN-NEXT:    v_fma_f32 v59, s4, v59, -v134
+  ; GCN-NEXT:    v_mul_f32_e32 v57, 0x3fb8aa3b, v57
   ; GCN-NEXT:    v_pk_mul_f32 v[80:81], v[80:81], v[48:49] op_sel_hi:[1,0]
-  ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[64:79], v[140:141], v[160:161], v[64:79]
   ; GCN-NEXT:    v_pk_mul_f32 v[82:83], v[82:83], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[84:85], v[84:85], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[86:87], v[86:87], v[48:49] op_sel_hi:[1,0]
@@ -543,10 +535,20 @@
   ; GCN-NEXT:    v_pk_mul_f32 v[90:91], v[90:91], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[92:93], v[92:93], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[94:95], v[94:95], v[48:49] op_sel_hi:[1,0]
-  ; GCN-NEXT:    ; implicit-def: $vgpr96_vgpr97_vgpr98_vgpr99_vgpr100_vgpr101_vgpr102_vgpr103_vgpr104_vgpr105_vgpr106_vgpr107_vgpr108_vgpr109_vgpr110_vgpr111
+  ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[64:79], v[140:141], v[160:161], v[64:79]
   ; GCN-NEXT:    v_exp_f32_e32 v58, v58
-  ; GCN-NEXT:    v_pk_mul_f32 v[96:97], v[96:97], v[48:49] op_sel_hi:[1,0]
+  ; GCN-NEXT:    v_fma_f32 v162, s4, v61, -v134
+  ; GCN-NEXT:    v_cvt_f16_f32_e32 v61, v55
+  ; GCN-NEXT:    v_cvt_f16_f32_e32 v140, v53
+  ; GCN-NEXT:    v_cvt_f16_f32_e32 v141, v54
+  ; GCN-NEXT:    v_fma_f32 v59, s4, v59, -v134
+  ; GCN-NEXT:    v_fma_f32 v60, s4, v60, -v134
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[80:95], v[144:145], v[160:161], v[80:95]
+  ; GCN-NEXT:    v_exp_f32_e32 v163, v57
+  ; GCN-NEXT:    v_mul_f32_e32 v57, 0x3fb8aa3b, v96
+  ; GCN-NEXT:    ; implicit-def: $vgpr96_vgpr97_vgpr98_vgpr99_vgpr100_vgpr101_vgpr102_vgpr103_vgpr104_vgpr105_vgpr106_vgpr107_vgpr108_vgpr109_vgpr110_vgpr111
+  ; GCN-NEXT:    v_pk_mul_f32 v[112:113], v[112:113], v[48:49] op_sel_hi:[1,0]
+  ; GCN-NEXT:    v_pk_mul_f32 v[96:97], v[96:97], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[98:99], v[98:99], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[100:101], v[100:101], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[102:103], v[102:103], v[48:49] op_sel_hi:[1,0]
@@ -554,35 +556,33 @@
   ; GCN-NEXT:    v_pk_mul_f32 v[106:107], v[106:107], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[108:109], v[108:109], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[110:111], v[110:111], v[48:49] op_sel_hi:[1,0]
-  ; GCN-NEXT:    v_pack_b32_f16 v145, v61, v57
-  ; GCN-NEXT:    v_mul_f32_e32 v57, 0x3fb8aa3b, v59
-  ; GCN-NEXT:    v_cvt_f16_f32_e32 v140, v53
-  ; GCN-NEXT:    v_cvt_f16_f32_e32 v141, v54
-  ; GCN-NEXT:    v_exp_f32_e32 v59, v57
-  ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[96:111], v[148:149], v[160:161], v[96:111]
-  ; GCN-NEXT:    v_fma_f32 v60, s4, v60, -v134
-  ; GCN-NEXT:    v_pk_mul_f32 v[112:113], v[112:113], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[114:115], v[114:115], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[116:117], v[116:117], v[48:49] op_sel_hi:[1,0]
+  ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[96:111], v[148:149], v[160:161], v[96:111]
+  ; GCN-NEXT:    v_exp_f32_e32 v164, v57
+  ; GCN-NEXT:    v_cvt_f16_f32_e32 v57, v56
   ; GCN-NEXT:    v_pk_mul_f32 v[118:119], v[118:119], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[120:121], v[120:121], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[122:123], v[122:123], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[124:125], v[124:125], v[48:49] op_sel_hi:[1,0]
   ; GCN-NEXT:    v_pk_mul_f32 v[126:127], v[126:127], v[48:49] op_sel_hi:[1,0]
+  ; GCN-NEXT:    v_pack_b32_f16 v145, v61, v57
+  ; GCN-NEXT:    v_mul_f32_e32 v57, 0x3fb8aa3b, v59
   ; GCN-NEXT:    v_fma_f32 v148, s4, v62, -v134
-  ; GCN-NEXT:    v_pack_b32_f16 v144, v140, v141
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[112:127], v[152:153], v[160:161], v[112:127]
   ; GCN-NEXT:    v_fma_f32 v152, s4, v63, -v134
+  ; GCN-NEXT:    v_pack_b32_f16 v144, v140, v141
+  ; GCN-NEXT:    v_exp_f32_e32 v59, v57
   ; GCN-NEXT:    v_mul_f32_e32 v149, 0x3fb8aa3b, v60
   ; GCN-NEXT:    ; implicit-def: $vgpr57
   ; GCN-NEXT:    ds_read_b128 v[60:63], v57
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    buffer_inv sc0 sc1
-  ; GCN-NEXT:    v_exp_f32_e32 v160, v149
   ; GCN-NEXT:    v_fma_f32 v161, s4, v33, -v134
   ; GCN-NEXT:    v_mul_f32_e32 v33, 0x3fb8aa3b, v148
-  ; GCN-NEXT:    v_cvt_f16_f32_e32 v153, v58
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[64:79], v[142:143], v[144:145], v[64:79]
+  ; GCN-NEXT:    v_exp_f32_e32 v160, v149
+  ; GCN-NEXT:    v_cvt_f16_f32_e32 v153, v58
   ; GCN-NEXT:    v_fma_f32 v32, s4, v32, -v134
   ; GCN-NEXT:    ds_read_b128 v[140:143], v57 offset:576
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
@@ -590,22 +590,20 @@
   ; GCN-NEXT:    v_fma_f32 v40, s4, v40, -v134
   ; GCN-NEXT:    v_fma_f32 v44, s4, v44, -v134
   ; GCN-NEXT:    v_fma_f32 v16, s4, v16, -v134
-  ; GCN-NEXT:    v_fma_f32 v166, s4, v20, -v134
-  ; GCN-NEXT:    v_fma_f32 v24, s4, v24, -v134
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[80:95], v[146:147], v[144:145], v[80:95]
   ; GCN-NEXT:    v_mul_f32_e32 v146, 0x3fb8aa3b, v162
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v147, v163
   ; GCN-NEXT:    v_exp_f32_e32 v162, v146
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v146, v164
-  ; GCN-NEXT:    v_fma_f32 v28, s4, v28, -v134
+  ; GCN-NEXT:    v_fma_f32 v166, s4, v20, -v134
   ; GCN-NEXT:    v_pack_b32_f16 v148, v153, v147
-  ; GCN-NEXT:    v_fma_f32 v0, s4, v0, -v134
+  ; GCN-NEXT:    v_fma_f32 v24, s4, v24, -v134
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[96:111], v[150:151], v[144:145], v[96:111]
   ; GCN-NEXT:    v_exp_f32_e32 v151, v33
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v33, v59
   ; GCN-NEXT:    v_fma_f32 v150, s4, v34, -v134
-  ; GCN-NEXT:    v_fma_f32 v8, s4, v8, -v134
-  ; GCN-NEXT:    v_fma_f32 v12, s4, v12, -v134
+  ; GCN-NEXT:    v_fma_f32 v28, s4, v28, -v134
+  ; GCN-NEXT:    v_fma_f32 v0, s4, v0, -v134
   ; GCN-NEXT:    v_pack_b32_f16 v149, v146, v33
   ; GCN-NEXT:    v_mul_f32_e32 v33, 0x3fb8aa3b, v152
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[112:127], v[154:155], v[144:145], v[112:127]
@@ -614,6 +612,8 @@
   ; GCN-NEXT:    v_fma_f32 v155, s4, v36, -v134
   ; GCN-NEXT:    v_perm_b32 v36, v158, v156, s5
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v154, v160
+  ; GCN-NEXT:    v_fma_f32 v8, s4, v8, -v134
+  ; GCN-NEXT:    v_fma_f32 v12, s4, v12, -v134
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[64:79], v[60:61], v[148:149], v[64:79]
   ; GCN-NEXT:    v_mul_f32_e32 v60, 0x3fb8aa3b, v32
   ; GCN-NEXT:    ds_read_b128 v[32:35], v57 offset:1152
@@ -787,12 +787,14 @@
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v45, v158
   ; GCN-NEXT:    v_perm_b32 v21, v148, v144, s5
   ; GCN-NEXT:    v_perm_b32 v37, v148, v144, s8
-  ; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v63
   ; GCN-NEXT:    ;;#ASMSTART
   ; GCN-NEXT:    s_waitcnt vmcnt(8)
   ; GCN-NEXT:    ;;#ASMEND
   ; GCN-NEXT:    buffer_wbl2 sc0 sc1
   ; GCN-NEXT:    ds_write_b64 v135, v[20:21]
+  ; GCN-NEXT:    buffer_wbl2 sc0 sc1
+  ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+  ; GCN-NEXT:    ds_write_b64 v136, v[36:37]
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[96:111], v[16:17], v[40:41], v[96:111]
   ; GCN-NEXT:    v_perm_b32 v16, v141, v131, s5
   ; GCN-NEXT:    v_fma_f32 v131, s4, v22, -v134
@@ -802,23 +804,19 @@
   ; GCN-NEXT:    v_perm_b32 v17, v149, v145, s5
   ; GCN-NEXT:    buffer_wbl2 sc0 sc1
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-  ; GCN-NEXT:    ds_write_b64 v136, v[36:37]
+  ; GCN-NEXT:    ds_write_b64 v137, v[16:17]
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[112:127], v[32:33], v[40:41], v[112:127]
   ; GCN-NEXT:    v_pack_b32_f16 v33, v45, v22
   ; GCN-NEXT:    v_mul_f32_e32 v22, 0x3fb8aa3b, v60
   ; GCN-NEXT:    v_exp_f32_e32 v144, v22
-  ; GCN-NEXT:    buffer_wbl2 sc0 sc1
-  ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-  ; GCN-NEXT:    ds_write_b64 v137, v[16:17]
   ; GCN-NEXT:    ; implicit-def: $vgpr17
   ; GCN-NEXT:    ; implicit-def: $vgpr22
+  ; GCN-NEXT:    v_cvt_f16_f32_e32 v44, v63
   ; GCN-NEXT:    buffer_wbl2 sc0 sc1
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    ds_write_b64 v138, v[42:43]
   ; GCN-NEXT:    v_add_u32_e32 v22, v132, v22
   ; GCN-NEXT:    v_add_u32_e32 v17, v132, v17
-  ; GCN-NEXT:    ; implicit-def: $vgpr20
-  ; GCN-NEXT:    ; implicit-def: $vgpr21
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    buffer_load_dwordx2 v[40:41], v22, s[0:3], 0 offen sc0 sc1
   ; GCN-NEXT:    s_waitcnt vmcnt(0)
@@ -826,9 +824,11 @@
   ; GCN-NEXT:    buffer_load_dwordx2 v[42:43], v17, s[0:3], 0 offen sc0 sc1
   ; GCN-NEXT:    s_waitcnt vmcnt(0)
   ; GCN-NEXT:    buffer_inv sc0 sc1
+  ; GCN-NEXT:    ; implicit-def: $vgpr20
+  ; GCN-NEXT:    ; implicit-def: $vgpr21
+  ; GCN-NEXT:    v_pack_b32_f16 v32, v61, v44
   ; GCN-NEXT:    v_add_u32_e32 v20, v132, v20
   ; GCN-NEXT:    v_add_u32_e32 v21, v132, v21
-  ; GCN-NEXT:    v_pack_b32_f16 v32, v61, v44
   ; GCN-NEXT:    buffer_load_dwordx2 v[44:45], v20, s[0:3], 0 offen sc0 sc1
   ; GCN-NEXT:    s_waitcnt vmcnt(0)
   ; GCN-NEXT:    buffer_inv sc0 sc1
@@ -959,27 +959,27 @@
   ; GCN-NEXT:    buffer_wbl2 sc0 sc1
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    ds_write_b64 v136, v[20:21]
+  ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[112:127], v[16:17], v[24:25], v[112:127]
+  ; GCN-NEXT:    v_pack_b32_f16 v17, v40, v6
+  ; GCN-NEXT:    v_mul_f32_e32 v6, 0x3fb8aa3b, v32
   ; GCN-NEXT:    buffer_wbl2 sc0 sc1
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    ds_write_b64 v137, v[0:1]
   ; GCN-NEXT:    buffer_wbl2 sc0 sc1
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    ds_write_b64 v138, v[26:27]
-  ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[112:127], v[16:17], v[24:25], v[112:127]
-  ; GCN-NEXT:    v_pack_b32_f16 v17, v40, v6
-  ; GCN-NEXT:    v_mul_f32_e32 v6, 0x3fb8aa3b, v32
+  ; GCN-NEXT:    v_exp_f32_e32 v25, v6
   ; GCN-NEXT:    ;;#ASMSTART
   ; GCN-NEXT:    s_waitcnt vmcnt(8)
   ; GCN-NEXT:    ;;#ASMEND
   ; GCN-NEXT:    v_pack_b32_f16 v16, v37, v28
   ; GCN-NEXT:    v_fma_f32 v24, s4, v7, -v134
-  ; GCN-NEXT:    v_exp_f32_e32 v25, v6
+  ; GCN-NEXT:    v_mul_f32_e32 v0, 0x3fb8aa3b, v149
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    ds_read_b128 v[4:7], v139
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    buffer_inv sc0 sc1
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[64:79], v[22:23], v[16:17], v[64:79]
-  ; GCN-NEXT:    v_mul_f32_e32 v0, 0x3fb8aa3b, v149
   ; GCN-NEXT:    v_exp_f32_e32 v26, v0
   ; GCN-NEXT:    v_mul_f32_e32 v0, 0x3fb8aa3b, v29
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v150
@@ -998,13 +998,13 @@
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v0, v25
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[112:127], v[18:19], v[16:17], v[112:127]
   ; GCN-NEXT:    v_pack_b32_f16 v17, v2, v0
-  ; GCN-NEXT:    v_pack_b32_f16 v16, v1, v27
   ; GCN-NEXT:    v_mul_f32_e32 v0, 0x3fb8aa3b, v24
-  ; GCN-NEXT:    v_fma_f32 v18, s4, v11, -v134
+  ; GCN-NEXT:    v_pack_b32_f16 v16, v1, v27
   ; GCN-NEXT:    v_exp_f32_e32 v19, v0
   ; GCN-NEXT:    ds_read_b128 v[0:3], v139 offset:1152
   ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
   ; GCN-NEXT:    buffer_inv sc0 sc1
+  ; GCN-NEXT:    v_fma_f32 v18, s4, v11, -v134
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[64:79], v[4:5], v[16:17], v[64:79]
   ; GCN-NEXT:    v_mul_f32_e32 v4, 0x3fb8aa3b, v8
   ; GCN-NEXT:    ds_read_b128 v[8:11], v139 offset:1728
@@ -1013,41 +1013,41 @@
   ; GCN-NEXT:    v_exp_f32_e32 v24, v4
   ; GCN-NEXT:    v_mul_f32_e32 v4, 0x3fb8aa3b, v28
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v5, v26
-  ; GCN-NEXT:    v_exp_f32_e32 v27, v4
-  ; GCN-NEXT:    v_mul_f32_e32 v4, 0x3fb8aa3b, v18
+  ; GCN-NEXT:    v_fma_f32 v28, s4, v14, -v134
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[80:95], v[20:21], v[16:17], v[80:95]
+  ; GCN-NEXT:    v_exp_f32_e32 v27, v4
   ; GCN-NEXT:    v_cvt_f16_f32_e32 v20, v29
+  ; GCN-NEXT:    v_mul_f32_e32 v4, 0x3fb8aa3b, v18
   ; GCN-NEXT:    v_fma_f32 v21, s4, v13, -v134
-  ; GCN-NEXT:    v_fma_f32 v28, s4, v14, -v134
   ; GCN-NEXT:    v_mfma_f32_32x32x8_f16 v[96:111], v[0:1], v[16:17], v[96:111]
   ; GCN-NEXT:    v_mul_f32_e32 v0, 0x3fb8aa3b, v30
-  ; GCN-NEXT:    v_cvt_f16_f32_e32 v1, v31
   ; GCN-NEXT:    v_exp_f32_e32 v30, v0
+  ; GCN-NEXT: ...
[truncated]
 | 
    
| 
           Ping (see questions above)  | 
    
| 
           Ping  | 
    
| 
           Ping.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The physreg handling is a bit off here, but it's the less important case. It's more important to get the virtual register case correct first
4cc7dc2    to
    a550d40      
    Compare
  
    | 
          
 ✅ With the latest revision this PR passed the C/C++ code formatter.  | 
    
a550d40    to
    512c730      
    Compare
  
    512c730    to
    5c66d77      
    Compare
  
    | 
           fixed merge conflict in   | 
    
| 
           I feel like there might be a conflict with c256966#diff-542d3e260c6e545f29ad24eb096fcfb90a0b6a4cbc08e613971c26451098c1fb. Needs further investigation.  | 
    
          
 The test probably should be more targeted than it is  | 
    
| if (RegA.isVirtual() && RegB.isVirtual()) { | ||
| if (RegA != RegB) | ||
| return false; | ||
| LaneBitmask LA = getSubRegIndexLaneMask(SubA); | ||
| LaneBitmask LB = getSubRegIndexLaneMask(SubB); | ||
| return (LA & LB).any(); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure overlap is a well formed question for virtual registers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it checks for overlapping subregs. Why would this make less sense to check for virtual regs than for physical ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the interference is an assignment decision for the virtual register. It's context dependent on the lane mask and position in the function
| TRI->getSubRegisterClass(RC, SubRegIdx)) | ||
| return SubRegIdx; | ||
| if (unsigned SubRegIdx = OpSel ? AMDGPU::hi16 : AMDGPU::lo16; | ||
| TRI->getSubRegisterClass(RC, SubRegIdx)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably should verify getSubRegisterClass returned RC, otherwise you would need a new constrainRegClass call somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I slightly misunderstood the semantics of getSubRegisterClass since it seems to be based on the second part of the returned pair from CodeGenRegisterClass::getMatchingSubClassWithSubRegs (see RegisterInfoEmitter.cpp), which means that the returned class is more restrictive as it needs to be, which makes it unsuitable to compare to RC.
I think getSubClassWithSubReg would be the more appropriate call if we want to check that the return value matches RC because it's based on CodeGenRegisterClass::getSubClassWithSubReg, which returns the largest subclass where all registers have the given subidx (also see RegisterInforEmitter.cpp).
However, while testing, I discovered that, for example, pk_mul_virtual_16_lo_lo_16_lo_lo or pk_mul_physical_16_lo_lo_16_lo_lo in my tests have the VS_32  for %0:vgpr_32/$vgpr0. If I now use getSubClassWithSubReg on them (with AMDGPU::lo16 as subreg idx), I get VS_32_with_hi16 back because, apparently, not all registers in VS_32 support lo16.
I found that one could either use TRI.getConstrainedRegClassForOperand (virtual) and TRI.getPhysRegBaseClass (physical), or TRI.getRegClassForOperandReg. The relevant differences are afaics that getRegClassForOperandReg already handles given subregs (à la %0.sub0) and that getConstrainedRegClassForOperand returns the largest allocatable class by also calling getAllocatableClass, which is probably not needed here (?). That's why I opted for TRI.getRegClassForOperandReg, which then also finally allows the results of TRI.getSubClassWithSubReg to be compared to RC.
There are some VOP3P instructions which operate on packed 32bit values and can be configured (op_sel/op_sel_hi) to only use one of the values. This patch adapts the scheduling dependencies so that a write to vgpr3, for example, is not a data dependency for a read from vgpr2_vgpr3 in case only vgpr2 is actually used.
b5d4fb9    to
    517da79      
    Compare
  
    | 
           fixed new merge conflict in   | 
    
| 
           A cleaner way to implement this would be to change the MachineInstr representation of these instructions so that the def/use operands represent only the registers (or only the 32-bit parts of 64-bit register tuples) that are actually read/written by the instruction. Then you would not need this extensive hacking in   | 
    
| 
           But aren't the true 16 instructions dedicated instructions? Here, the instructions that deal with 16-bit values, for example, do take 32-bit operands and then select the relevant parts (which may be one or both parts) according to op_sel/op_sel_hi. So, the instruction using only the lower half of their Nth operand, the instruction using only the upper half, and the instruction using both halves are essentially the same instruction, just with different op_sel/op_sel_hi values  | 
    
          
 No. True16 is just a compiler mode switch that affects how we do register allocation. The hardware instructions are identical - the hardware does not know or care whether the compiler ran in true16 mode or not.  | 
    
There are some VOP3P instructions which operate on packed 32bit values and can be configured (op_sel/op_sel_hi) to only use one of the values. This patch adapts the scheduling dependencies so that a write to vgpr3, for example, is not a data dependency for a read from vgpr2_vgpr3 in case only vgpr2 is actually used.
@arsenm (see #137549 (comment))